This worksheet covers concepts covered in the second half of Module 1 - Exploratory Data Analysis in Two Dimensions. It should take no more than 20-30 minutes to complete. Please raise your hand if you get stuck.
There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple.
For this exercise, we will be using:
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline
In the /data/ folder, you will find a series of .json files called dataN.json, numbered 1-4. Each file contains the following data:
| birthday | first_name | last_name | |
|---|---|---|---|
| 0 | 5\/3\/67 | Robert | Hernandez |
| 1 | 8\/4\/84 | Steve | Smith |
| 2 | 9\/13\/91 | Anne | Raps |
| 3 | 4\/15\/75 | Alice | Muller |
Using the .read_json() function and the various configuration options, read all these files into a dataframe. The documentation is available here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html.
In [27]:
df1 = pd.read_json('../../data/data1.json')
df1
Out[27]:
In [8]:
df2 = pd.read_json('../../data/data2.json', orient='index')
df2
Out[8]:
In [30]:
df3 = pd.read_json('../../data/data3.json', orient='columns')
df3
Out[30]:
In [31]:
df4 = pd.read_json('../../data/data4.json', orient='split')
df4
Out[31]:
In the data file, there is a webserver file called hackers-access.httpd. For this exercise, you will use this file to answer the following questions:
In order to accomplish this task, do the following:
user_agents module, the documentation for which is available here: (https://pypi.python.org/pypi/user-agents)
In [6]:
import apache_log_parser
from user_agents import parse
#Read in the log file
line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"")
server_log = open("../../data/hackers-access.httpd", "r")
parsed_server_data = []
for line in server_log:
data = {}
data = line_parser(line)
parsed_server_data.append( data )
server_df = pd.DataFrame( parsed_server_data )
In [11]:
#Write the functions
def get_os(x):
user_agent = parse(x)
return user_agent.os.family
def get_browser(x):
user_agent = parse(x)
return user_agent.browser.family
In [12]:
#Apply the functions to the dataframe
server_df['os'] = server_df['request_header_user_agent'].apply( get_os )
server_df['browser'] = server_df['request_header_user_agent'].apply( get_browser )
In [13]:
#Get the top 10 values
server_df['os'].value_counts().head(10)
Out[13]:
In [14]:
server_df['browser'].value_counts().head(10)
Out[14]:
Using the dailybots.csv film, read the file into a DataFrame and perform the following operations:
groupby() function which is documented here: (http://pandas.pydata.org/pandas-docs/stable/groupby.html)
In [20]:
bots = pd.read_csv('../../data/dailybots.csv')
bots.head()
Out[20]:
In [24]:
gov_bots = bots[['botfam', 'hosts']][bots['industry'] == "Government/Politics"]
In [26]:
gov_bots.groupby('botfam', as_index=False).sum()
Out[26]: